This tutorial demonstrates a (semi-)automated method for downloading
grey literature, using Technology Appraisal documents from the National
Institute for Health and Care Excellence (NICE) as an example.
The same web-scraping approach can be used to download documents
where patterns in PDF/HTML URLs can be identified across various types
of grey literature (i.e., not limited to NICE documents). For exceptions
that don’t follow regular patterns, URLs can be extracted from the page
source of static web pages. While example code for this second approach
is provided in the appendix, a more detailed tutorial may be included in
a future update.
Once the files are downloaded, manual checks will still be required
during the literature review process. However, this method will save
significant time compared to downloading each document
individually.
For simplicity, this tutorial uses the systematic download of 10 TAs
as an example, but it can easily be applied to download hundreds of TAs
at once.
This tutorial is based on code originally developed for a systematic
review conducted in 2020 (see Appendix). The motivation for creating
this code was the lack of platforms capable of systematically
downloading grey literature for review purposes. The code was initially
designed to screen/review 460 NICE Technology Appraisals (TAs) related
to health technology assessments (HTAs) that discuss treatment
sequences.
Conference presentations on this topic are
available below, and a journal publication is in progress.
Chang JYA, Latimer LR, Gillespie D, Chilcott J. Prevalence,
Characteristics, and Key Issues of Modelling Treatment Sequences in
Health Economic Evaluations Virtual ISPOR Europe 2020, Nov 16-19,
2020. [Poster Presentation]
Chang JYA, Chilcott JB, Latimer, NR. Exploring
Data-Driven Challenges in Modelling the Effectiveness of Treatment
Sequences in Health Economic Evaluations The 15th IHEA World
Congress on Health Economics, July 8-12, 2023. [Oral Presentation]
Full results of the systematic review can be found in Chapter 3 of my
PhD thesis.
Now, let’s begin with the method.
First, navigate to the NICE website. Identify the indices of TAs (i.e., TA numbers) that you are interested in reviewing.
Here, I use Type
2 diabetes TAs as an example. There are 10 relevant appraisals, as
shown in the screenshot below, including one that was terminated:
TA1006(terminated), TA924, TA877, TA583, TA572, TA288, TA418, TA390,
TA336, TA315
One TA (TA1006) was terminated, so a vector of the indices for the remaining 9 TAs of interest is created:
# Create a vector containing indices of TAs that you are interested in reviewing
TA_vector <- c(924, 877, 583, 572, 288, 418, 390, 336, 315)
# sort the TA_vector (if it's not sorted)
TA_vector <- sort(TA_vector, decreasing = T)
Note: Alternatively, you can download the TA
recommendation list from this NICE webpage and clean it in R or
Excel (e.g., for title screening). You can then import the list into R.
This approach is particularly useful if there are many entries (e.g.,
more than 20 TAs) (see example in Appendix). The NICE TA recommendation
list includes removed and terminated TAs, while the active
TA search list may not always show those that have been replaced or
terminated.
In this example, the aim is to download and review three key
documents for each TA:
1.
Company/Manufacturer Submission report (CS)
2. Evidence Review Group (ERG) Report /
Assessment Group (AG) Report / External Assessment Group (EAG)
Report
3. Final Appraisal
Determination (FAD)
For illustration, let’s use “TA877: Finerenone for treating chronic kidney disease in type 2 diabetes” as an example. On the document history page of TA877, you can find all the documents related to this appraisal (see screenshot below).
In recent TAs (i.e., approximately after TA350), the initial CS
report and ERG/AG/EAG report are typically included in the committee
papers (CP) under the draft guidance or initial
consultation section, while the final FAD is usually found in the
final draft guidance section (see the red highlighted boxes in
the screenshot above).
Note1: Multiple versions of
these documents may exist if the appraisal went through multiple stages,
such as an appeal, but we’re simplifying here. Nevertheless, in this
tutorial, we will use slight variations in the URL to download both the
initial and subsequent versions of CS and ERG/EAG/AG documents within a
TA.
Note2: Handling earlier TAs, which may require
different URL structures for CS and ERG/EAG/AG documents, is not covered
here but is included in the code in Appendix for reference.
The screenshot below shows the content list and URL (green
highlighted box) for the pdf of the committee papers under the draft
guidance of TA877, as previously mentioned. The red boxes highlight the
locations of the CS report and the EAG report.
The URL pattern for CPs of different TAs typically follows the
following structure:
https://www.nice.org.uk/guidance/ta877/documents/committee-papers
That is, to access the CPs for other TAs, simply replace “877” with
the relevant TA index to access the initial CP for other TAs.
The screenshot below shows the PDF of the FAD in the final draft
guidance of TA877 and its URL (green highlighted box).
The URL pattern for the final FAD of different TAs typically follows
the following structure:
https://www.nice.org.uk/guidance/ta877/documents/final-appraisal-determination-document
That is, to access the FAD for other TAs, simply replace “877” with
the relevant TA indices.
Now that we have the URL pattern, we can automate the download of
multiple PDFs at once. The steps are as follows:
1.
Build a download function: We’ll create a function that
automates the download of PDFs based on a list of URLs.
2.
Prepare URL lists and bulk download the PDFs: Next,
we’ll create separate lists of URLs for each document type (i.e., CPs,
FADs). We’ll use the function to download PDFs from each URL list for
different document types.
3. Track downloads:
Create files to track which documents were successfully downloaded and
which failed within each TA.
Below is the function we’ll use to automate this process:
# Create a function for downloading files and return a data frame showing download status for each bulk download
file_download <- function(TA_number, url, file_name) {
error_vec <- rep("NA", times = length(TA_number)) # Initialise error status
download_status <- data.frame (TA = TA_number, # Create status tracking data frame
Status = as.character(error_vec))
for (n in 1:length(TA_number)) {
tryCatch({
download.file(url[n], destfile = file_name[n], mode="wb") # Download file
download_status$Status[n] <- "downloaded" # Mark as downloaded if successful
}, error = function(e){
# Silently handle the error for the purpose of the tutorial
NULL # See Appendix of how it can be customised
})
}
download_status[download_status$Status != "downloaded", "Status"] <- "NA"
return(download_status) # Return download status data frame
}
This function automates the download process, making it easy to
download multiple PDFs by simply providing a list of URLs. In the
following sections, we’ll create lists of URLs for different document
types and use this file_download function to download them
efficiently.
For CPs (which include CS and ERG/EAG/AG documents), here’s how to
create a list of all 9 TAs from Section 3 in a vector and download them
together using the file_download function:
# Based on the aforementioned URL patterns, generate URLs and file names for all 9 type 2 diabetes TAs using the TA_vector and download them together using the file_download function
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers", sep = "")
file_name <- paste("TA", TA_vector, "_CP.pdf", sep = "") # To save the files in the "out" folder and add a "CP" suffix in their file name
# Use the file_download function to download all committee papers
Fulltext_CP <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
# Check how many files were successfully downloaded
length(Fulltext_CP[Fulltext_CP$Status == "downloaded", "TA"])
## [1] 6
# Show the download status (use this only for small lists; otherwise, check in R).
Fulltext_CP
## TA Status
## 1 924 downloaded
## 2 877 downloaded
## 3 583 downloaded
## 4 572 downloaded
## 5 418 downloaded
## 6 390 downloaded
## 7 336 NA
## 8 315 NA
## 9 288 NA
The first bulk download attempt using the identified CP URL
pattern successfully retrieved 6 out of 9 CP reports, as indicated by
the “Status” in the download tracking list (i.e.,
Fulltext_CP). According to the tracking list, TA336, TA315,
and TA288 could not be downloaded using the same URL pattern (i.e.,
displaying NA). These may need to be downloaded manually or
by finding alternative URL patterns using the method outlined in the
Appendix.
For now, we can attempt to download subsequent (i.e., non-initial) CP
documents by using the same method to identify URLs, experimenting with
different variations. In some cases, a higher suffix (e.g., -2) may
actually correspond to the initial reports, which can be confirmed upon
reviewing the file content. The goal is to download all available
versions of the CP documents wherever possible.
# CP that had an URL ending of 2
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-2", sep = "")
file_name <- paste("TA", TA_vector, "_CP2.pdf", sep = "")
Fulltext_CP2 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP2[Fulltext_CP2$Status == "downloaded","TA"])
## [1] 3
Fulltext_CP$Status2 <- Fulltext_CP2$Status # Amend status in the main download tracking list
Fulltext_CP
## TA Status Status2
## 1 924 downloaded NA
## 2 877 downloaded downloaded
## 3 583 downloaded NA
## 4 572 downloaded NA
## 5 418 downloaded downloaded
## 6 390 downloaded downloaded
## 7 336 NA NA
## 8 315 NA NA
## 9 288 NA NA
For CP documents that had an URL ending in 2, three additional
reports from TA877, TA418, and TA390 were downloaded, as indicated by
“Status2” in the CP download tracking list.
# CP that had an URL ending of 3
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-3", sep = "")
file_name <- paste("TA", TA_vector, "_CP3.pdf", sep = "")
Fulltext_CP3 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP3[Fulltext_CP3$Status == "downloaded","TA"]) # Amend status in the main download tracking list
## [1] 2
Fulltext_CP$Status3 <- Fulltext_CP3$Status
Fulltext_CP
## TA Status Status2 Status3
## 1 924 downloaded NA downloaded
## 2 877 downloaded downloaded NA
## 3 583 downloaded NA NA
## 4 572 downloaded NA NA
## 5 418 downloaded downloaded downloaded
## 6 390 downloaded downloaded NA
## 7 336 NA NA NA
## 8 315 NA NA NA
## 9 288 NA NA NA
For CP documents ending in 3, two additional reports from TA924,
TA418 were downloaded, as indicated by “Status3” in the CP download
tracking list..
# CP that had an URL ending of 4
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-4", sep = "")
file_name <- paste("TA", TA_vector, "_CP4.pdf", sep = "")
Fulltext_CP4 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP4[Fulltext_CP4$Status == "downloaded","TA"]) # Amend status in the main download tracking list
## [1] 0
Fulltext_CP$Status4 <- Fulltext_CP4$Status
Fulltext_CP
## TA Status Status2 Status3 Status4
## 1 924 downloaded NA downloaded NA
## 2 877 downloaded downloaded NA NA
## 3 583 downloaded NA NA NA
## 4 572 downloaded NA NA NA
## 5 418 downloaded downloaded downloaded NA
## 6 390 downloaded downloaded NA NA
## 7 336 NA NA NA NA
## 8 315 NA NA NA NA
## 9 288 NA NA NA NA
From the CP download tracking list, we know that for CP
documents with a URL variation ending in 4, no additional downloads were
successful, as indicated by “Status4” in the CP download tracking list
(i.e., Fulltext_CP). This suggests that the highest valid
URL for this set of TAs may end at 3 (rather than 4). Further testing
could be done with a URL variation ending in 5 for real cases, while
this holds true for this tutorial example.
As for TA336, TA315, and
TA288, no CP documents were downloadable using similar URL patterns.
These documents can either be downloaded manually or by using a more
exhaustive web scraping approach, scraping all available URLs from the
page source of the history document website (as shown in the Appendix).
This tutorial may be updated in the future to include a detailed
explanation of this part of the code. Note that documents with very
different URL patterns are generally from TAs approximately before
TA350, as earlier appraisals often uploaded CS documents separately from
ERG/AG reports. Therefore, the more complex approach is less likely
needed when reviewing only TAs numbered above 350.
The screenshot below shows that all available CP documents were
downloaded to the project folder within 1-2 minutes.
Similar to CP documents, for FADs, here’s how to create a list of
FAD-related URLs for all 9 TAs from Section 3 in a vector and download
them using the file_download function:
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document", sep = "")
file_name <- paste("TA", TA_vector, "_FAD.pdf", sep = "")
Fulltext_FAD <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_FAD[Fulltext_FAD$Status == "NA","TA"])
## [1] 4
Fulltext_FAD
## TA Status
## 1 924 NA
## 2 877 downloaded
## 3 583 downloaded
## 4 572 downloaded
## 5 418 downloaded
## 6 390 downloaded
## 7 336 NA
## 8 315 NA
## 9 288 NA
The first bulk download attempt using the identified FAD URL
pattern retrieved 5 out of 9 CP reports, as indicated by “Status” in the
FAD download tracking list. According to the tracking list, TA924,
TA336, TA315, and TA288 could not be downloaded using the same URL
pattern. These may need to be downloaded manually or by finding
alternative URL patterns using the method outlined in the Appendix.
These may need to be downloaded manually or by finding alternative URL
patterns using the method outlined in the Appendix.
For now, we can attempt to download subsequent (i.e., non-final) FAD
documents by using the same method to identify URLs, experimenting with
different variations. In some cases, a higher suffix (e.g., -2) may
actually correspond to the final reports (espeically in TA appeals),
which can be confirmed upon reviewing the file content. The goal is to
download all available versions of the FAD documents wherever
possible.
# url with FAD document 2
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document-2", sep = "")
file_name <- paste("TA", TA_vector, "_FAD2.pdf", sep = "")
Fulltext_FAD2 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_FAD2[Fulltext_FAD2$Status == "downloaded", "TA"]
## numeric(0)
Fulltext_FAD$Status2 <- Fulltext_FAD2$Status
Fulltext_FAD
## TA Status Status2
## 1 924 NA NA
## 2 877 downloaded NA
## 3 583 downloaded NA
## 4 572 downloaded NA
## 5 418 downloaded NA
## 6 390 downloaded NA
## 7 336 NA NA
## 8 315 NA NA
## 9 288 NA NA
For FAD documents with a URL variation ending in 2, no
additional downloads were successful, as indicated by “Status2” in the
updated download tracking list for FAD documents. This suggests that the
highest valid URL for this set of TAs may end at 2. Further testing
could be done with a variation ending in 3 for real cases, but this
holds true for this tutorial example.
As for TA924, TA336, TA315,
and TA288, no FAD documents were downloadable using similar URL
patterns. These documents can either be downloaded manually or by using
a more exhaustive web scraping approach, scraping all available URLs
from the page source of the history document website (as shown in the
Appendix). This tutorial may be updated in the future to include a
detailed explanation of this part of the code. Note that these documents
are typically from approximately before TA350, so the more complicated
alternative approach is less likely needed when reviewing TAs numbered
above 350.
Create files to store the TA titles (scraped from the NICE website)
and track the downloads for CP and FAD documents. These tracking lists
can serve as a guide for identifying which TAs may need further manual
downloads.
# create url
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/history", sep = "")
# create dataframe for storage
title <- data.frame(TA = TA_vector,
title = rep(NA, times = length(TA_vector)))
for(n in 1:length(TA_vector)){
tryCatch({
# read html
webpage <- read_html(url[n])
# Using CSS selectors to scrape the rankings section
title_data_html <- html_nodes(webpage,'#content-start')
title$title[n] <- html_text(title_data_html)
}, error = function(e){
cat("ERROR :",conditionMessage(e), "\n")
})
# random sleeping time
sleepy = sample(c(0.5:2.5), 1)
cat("\n let's just wait for",sleepy,"seconds...")
Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
# Clean
title$title <- gsub("\\n", "", title$title)
title$title <- trimws(title$title)
}
##
## let's just wait for 0.5 seconds...
## let's just wait for 2.5 seconds...
## let's just wait for 2.5 seconds...
## let's just wait for 2.5 seconds...
## let's just wait for 2.5 seconds...
## let's just wait for 2.5 seconds...
## let's just wait for 2.5 seconds...
## let's just wait for 0.5 seconds...
## let's just wait for 2.5 seconds...
# Store title
write.csv(title, "Title.csv")
# Store FAD & CP download tracking list
write.csv(Fulltext_CP, "Fulltext_CP.csv")
write.csv(Fulltext_FAD, "Fulltext_FAD.csv")
Note: The waiting messages indicate pauses between each
attempt to scrape the NICE website, as sending queries too quickly can
result in a temporary IP ban from downloads.
Below is an example of how the file that records the title of each TA
(Title.csv) appears:
Here’s a snapshot of the CP download tracking list
(Fulltext_CP.csv):
And here is the FAD download tracking list
(Fulltext_FAD.csv):
These lists help track which files were automatically downloaded and
identify where manual downloads may be necessary.
Below is an
example of the manually updated download tracking list (CP and FAD) from
the aforementioned systematic
review, following verification of the downloaded files. The red
highlighted boxes indicate where manual downloads were necessary.
For
alternative methods to automate downloads when none of the documents
within a TA successfully download using the usual URL patterns (i.e.,
all documents for TA336, TA315, and TA288, and the FAD document for
TA924), refer to the code in the Appendix. :)
Finally, here is
a screenshot of all the files downloaded and saved during this
tutorial.
The following original code for our 2020 systematic review of treatment sequences in health technology appraisals, designed to explore their prevalence, characteristics, and key issues in modeling treatment sequences within health economic evaluations.
The list of reviewed TAs required to initialize the code is available in the GitHub repository as TA_list_20191201.csv
Disclaimer: Please note that the code was developed before the era of large language models (LLMs) and may not be highly efficient (and yes there might be typos! Please get in contact if you spot any, thanks!), but it served its purpose at the time! ;)
########################################################################
# Project: Review NICE TA regarding treatment sequencing problem
# Automate the process of downloading TA documents
# Manually downloading e.g. 60 files can take up to 2 hours
# and may lead to mistakes due to incorrect manual entries.
# Create: JY Amy Chang
# Date: 02Mar2020
########################################################################
# Download pdf files
install.packages("XML")
install.packages("bitops")
install.packages("RCurl")
install.packages("httr")
install.packages("xml2")
install.packages("rvest")
install.packages("stringr")
install.packages("truncnorm")
library(bitops)
library(RCurl)
library(XML)
library(httr)
library(xml2)
library(rvest)
library(stringr)
library(pagedown)
library(truncnorm)
# import active TA list for review
TA_list <- read.csv(file = "raw/TA_list_20191201.csv", header = FALSE)
TA_vector <- as.vector(TA_list[,"V1"])
# transform TA list into vectors for creating ulr and file name for bulk download
# sort vector to download from lastest to the oldest
TA_vector <- sort(TA_vector, decreasing = T)
# create a function for downloading files, return data frame error_vec that indicates which files are not downloaded
file_download <- function(TA_number, url, file_name) {
error_vec <- rep("NA", times = length(TA_number)) # 460
download_status <- data.frame (TA = TA_number,
Status = error_vec)
download_status$Status <- as.character(download_status$Status)
for (n in 1:length(TA_number)) {
tryCatch({
download.file(url[n], destfile = file_name[n], mode="wb")
download_status$Status[n] <- "downloaded"
}, error = function(e){
cat("ERROR :",conditionMessage(e), "\n")
})
}
download_status[download_status$Status != "downloaded", "Status"] <- "NA"
return(download_status)
}
###########################################
# FAD (Final Appraisal Determination)
###########################################
# download FAD #not all file use the same logic of url
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document", sep = "")
file_name <- paste("TA", TA_vector, "_FAD.pdf", sep = "")
Fulltext_FAD <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_FAD[Fulltext_FAD$Status == "NA","TA"]) # 259 undownloaded
write.csv(Fulltext_FAD, "Fulltext_FAD.csv")
# url with document 2
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document-2", sep = "")
file_name <- paste("TA", TA_vector, "_FAD2.pdf", sep = "")
Fulltext_FAD2 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_FAD2[Fulltext_FAD2$Status == "downloaded", "TA"] # 42 downloaded
Fulltext_FAD$Status2 <- Fulltext_FAD2$Status
# test if there is document 1
# url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document-1", sep = "")
# file_name <- paste("TA", TA_vector, "_FAD1.pdf", sep = "")
# Fulltext_FAD1 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
# Fulltext_FAD1[Fulltext_FAD1$Status == "downloaded", "TA"] # 0 downloaded
# url with document 3
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document-3", sep = "")
file_name <- paste("TA", TA_vector, "_FAD3.pdf", sep = "")
Fulltext_FAD3 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_FAD3[Fulltext_FAD3$Status == "downloaded", "TA"] # 9 downloaded
Fulltext_FAD$Status3 <- Fulltext_FAD3$Status
# url with document 4
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document-4", sep = "")
file_name <- paste("TA", TA_vector, "_FAD4.pdf", sep = "")
Fulltext_FAD4 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_FAD4[Fulltext_FAD4$Status == "downloaded", "TA"] # 2 downloaded
Fulltext_FAD$Status4 <- Fulltext_FAD4$Status #[1] 491 487 these two has more documents due to CDF and managed access
#############################
# CP (Committee Paper)
#############################
# Committee paper is more complicated than FAD as the first committee paper is usually consultation document
# CP document 1
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers", sep = "")
file_name <- paste("TA", TA_vector, "_CP.pdf", sep = "")
Fulltext_CP <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP[Fulltext_CP$Status == "downloaded","TA"]) # 204 undownloaded
#write.csv(Fulltext_FAD, "Fulltext_CP.csv")
# CP document 2
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-2", sep = "")
file_name <- paste("TA", TA_vector, "_CP2.pdf", sep = "")
Fulltext_CP2 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP2[Fulltext_CP2$Status == "downloaded","TA"]) # 136 undownloaded
Fulltext_CP$Status2 <- Fulltext_CP2$Status
# CP document 3
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-3", sep = "")
file_name <- paste("TA", TA_vector, "_CP3.pdf", sep = "")
Fulltext_CP3 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP3[Fulltext_CP3$Status == "downloaded","TA"]) # 63 undownloaded
Fulltext_CP$Status3 <- Fulltext_CP3$Status
# CP document 4 (sometimes it can be just slides)
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-4", sep = "")
file_name <- paste("TA", TA_vector, "_CP4.pdf", sep = "")
Fulltext_CP4 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP4[Fulltext_CP4$Status == "downloaded","TA"]) # 32 downloaded
Fulltext_CP$Status4 <- Fulltext_CP4$Status
# CP document 5 (can be managing access agreement)
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-5", sep = "")
file_name <- paste("TA", TA_vector, "_CP5.pdf", sep = "")
Fulltext_CP5 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP5[Fulltext_CP5$Status == "downloaded","TA"] # 17 downloaded
# [1] 588 541 510 502 495 491 484 483 479 474 473 472 445 432 423 417 402
Fulltext_CP$Status5 <- Fulltext_CP5$Status
# CP document 6 (can be CDF glossry)
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-6", sep = "")
file_name <- paste("TA", TA_vector, "_CP6.pdf", sep = "")
Fulltext_CP6 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP6[Fulltext_CP6$Status == "downloaded","TA"] # 7 downloaded
# [1] 510 484 483 479 474 473 402
Fulltext_CP$Status6 <- Fulltext_CP6$Status
# CP document 7 (can be CDF glossry) (for those who had 2 times consultation: seems like the maximum)
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-7", sep = "")
file_name <- paste("TA", TA_vector, "_CP7.pdf", sep = "")
Fulltext_CP7 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP7[Fulltext_CP7$Status == "downloaded","TA"] # 3 downloaded
# [1] 484 474 473
Fulltext_CP$Status7 <- Fulltext_CP7$Status
# CP document 8 sorabenib, email
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-8", sep = "")
file_name <- paste("TA", TA_vector, "_CP8.pdf", sep = "")
Fulltext_CP8 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP8[Fulltext_CP8$Status == "downloaded","TA"] # 2 downloaded
# [1] 474 473
Fulltext_CP$Status8 <- Fulltext_CP8$Status
# CP document 9 sorabenib, email
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-9", sep = "")
file_name <- paste("TA", TA_vector, "_CP9.pdf", sep = "")
Fulltext_CP9 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP9[Fulltext_CP9$Status == "downloaded","TA"] # 2 downloaded
# [1] 474 473
Fulltext_CP$Status9 <- Fulltext_CP9$Status
# CP document 10 sorabenib, email
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-10", sep = "")
file_name <- paste("TA", TA_vector, "_CP10.pdf", sep = "")
Fulltext_CP10 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP10[Fulltext_CP10$Status == "downloaded","TA"] # 2 downloaded
# [1] 473
Fulltext_CP$Status10 <- Fulltext_CP10$Status
# There are further documents of TA 473 up to document 12 (but will not download it due to irrelavence)
####################################################################
# FIND TERMINATED appraisals
####################################################################
# SOLVED WITH HELP FROM
# TUTORIAL https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/
# https://selectorgadget.com/
# create url
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/history", sep = "")
# create dataframe for storage
title <- data.frame(TA = TA_vector,
title = rep(NA, times = length(TA_vector)))
for(n in 1:length(TA_vector)){
tryCatch({
# read html
webpage <- read_html(url[n])
# Using CSS selectors to scrape the rankings section
title_data_html <- html_nodes(webpage,'#content-start')
title$title[n] <- html_text(title_data_html)
}, error = function(e){
cat("ERROR :",conditionMessage(e), "\n")
})
# random sleeping time
sleepy = sample(c(0.5:2.5), 1)
cat("\n let's just wait for",sleepy,"seconds...")
Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
}
title[title$title == "NA","TA"] # none (all titles are downloaded)
# update FAD & CP document
Fulltext_CP[grep("terminated", title$title), 2:10] <- "terminated"
write.csv(Fulltext_CP, "Fulltext_CP_20200305.csv")
Fulltext_FAD[grep("terminated", title$title), 2:5 ] <- "terminated"
write.csv(Fulltext_FAD, "Fulltext_FAD_20200302.csv")
##############################################
# Create a file to store TA titles
##############################################
# create url
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/history", sep = "")
# create dataframe for storage
documents <- data.frame(TA = TA_vector,
title = rep(NA, times = length(TA_vector)))
for(n in 1:length(TA_vector)){
tryCatch({
# read html
webpage <- read_html(url[n])
# Using CSS selectors to scrape the rankings section
title_data_html <- html_nodes(webpage,'#content-start')
title$title[n] <- html_text(title_data_html)
}, error = function(e){
cat("ERROR :",conditionMessage(e), "\n")
})
# random sleeping time
sleepy = sample(c(0.5:2.5), 1)
cat("\n let's just wait for",sleepy,"seconds...")
Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
}
write.csv(documents, "Title_20200302.csv")
##############################################
# trying to download earlier FAD & CP
# less regularly named
##############################################
# https://stackoverflow.com/questions/57008774/r-help-me-to-scrap-links-from-webpage
# https://i.stack.imgur.com/bISth.jpg
# the key point is to find everything is under the big trunk of <ul class="media-list">
base <- 'https://www.nice.org.uk'
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/history", sep = "")
# not using the following code because it produces only the relative link
# links <- read_html(url) %>% html_nodes(., ".media-list a") %>% html_attr(., "href")
# default to download committee-papers (".media-list a" need to be changed in the function if it is not stored there)
link_download <- function(TA_number = 1:length(TA_vector), url = url, search_term = "committee-papers") {
# create vacant table for storing links
links_df <- data.frame(TA = TA_vector,
link = rep(NA, times = length(TA_vector)),
link2 = rep(NA, times = length(TA_vector)),
link3 = rep(NA, times = length(TA_vector)),
link4 = rep(NA, times = length(TA_vector)),
link5 = rep(NA, times = length(TA_vector)),
link6 = rep(NA, times = length(TA_vector)),
link7 = rep(NA, times = length(TA_vector)),
link8 = rep(NA, times = length(TA_vector)),
link9 = rep(NA, times = length(TA_vector)),
link10 = rep(NA, times = length(TA_vector)))
for(n in c(TA_number)){
tryCatch({
# output links of
links <- url_absolute(read_html(url[n]) %>% html_nodes(., ".media-list a") %>% html_attr(., "href"), base)
links_df_temp <- links[grep(search_term, links)]
links_df[n, 2:(length(links_df_temp)+1)] <- t(links_df_temp)
}, error = function(e){
cat("ERROR :",conditionMessage(e), "\n")
})
# random sleeping time
sleepy = rtruncnorm(n = 1, a = 0.000001, mean = 0.8, sd = 0.3)
cat("\n let's just wait for",sleepy,"seconds...")
Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
}
return(links_df)
}
title <- read.csv(file = "output/Title_20200302.csv", header = TRUE)
title$X <- NULL
TA_vector_index <- grep("terminated", title$title)
# download FAD links (FAD links normall will link to pdf document, html FAD overview will not have "final-appraisal-determination" as link )
FAD_links <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "final-appraisal-determination")
summary(as.factor(rowSums(!is.na(FAD_links[ , 2:22])))) # 61 TAs that has no links (less than TA that terminated)
# 0 1 2 3 4 5 9 12 21
# 61 184 185 11 11 4 2 1 1
# reduced link up to 5 and manual download those more than 5 links of not corretly downloaded
FAD_links[ , 7:22] <- NULL
#download function with links matrix
file_download2 <- function(TA_number = c(1:length(TA_vector)), url = FAD_links, file_name){
for (n in c(TA_number)) {
url_vector <- na.omit(unlist(url[n, 2:length(url[1,])]))
if (length(url_vector) != 0) {
for (k in 1:length(url_vector)) {
tryCatch({
download.file(url_vector[k], destfile = file_name[n, k], mode="wb")
download_status[n, k + 1] <- "downloaded"
}, error = function(e){
cat("ERROR :",conditionMessage(e), "\n")})
}
}
# random sleeping time
sleepy = round(rtruncnorm(n = 1, a = 0.000001, mean = 0.8, sd = 0.3), 1)
cat("\n let's just wait for",round(sleepy, 1),"seconds...")
Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
}
download_status <- data.frame(TA = download_status[ , 1],
Status = download_status[ , 2:length(url[1,])])
return(download_status)
}
# find where files need to be re-downloaded (where there is no TA at all)
FAD <- read.csv(file = "output/Fulltext_FAD_20200302.csv", header = TRUE)
FAD$X <- NULL
# initial file download of FAD
# use FAD as base case and add one more column
FAD_temp <- FAD
FAD_temp$Status5 <- NA
download_status <- as.matrix(FAD_temp)
file_name <- matrix (c(paste("TA", TA_vector, "_FAD1.pdf", sep = ""),
paste("TA", TA_vector, "_FAD2.pdf", sep = ""),
paste("TA", TA_vector, "_FAD3.pdf", sep = ""),
paste("TA", TA_vector, "_FAD4.pdf", sep = ""),
paste("TA", TA_vector, "_FAD5.pdf", sep = "")),
ncol = 5, byrow = F)
# Download FAD file with links # starting from TA404 there is no FAD (index 204)
TA_vector_index <- 1:length(TA_vector)
Fulltext_FAD_amend <- file_download2(TA_number = TA_vector_index[rowSums(!is.na(FAD[ , 2:5])) == 0],
url = FAD_links, file_name)
#######################################################
# create exclusion dataframe
# (based on fetch FAD link result)
#######################################################
# output TAnumber where there is no FAD and not terminated
`%notin%` <- Negate(`%in%`)
No_FAD_link <- FAD_links$TA[rowSums(!is.na(FAD_links[ , 2:22])) == 0]
No_FAD_link[No_FAD_link %notin% TA_vector[TA_vector_terminated]]
length(FAD_links$TA[rowSums(!is.na(FAD_links[ , 2:11])) == 0]) #18
# [1] *532(withdrawn no longer on market)
# *493(replaced by NG616)
# *459(withdrawn)
# 404(outlier "fad-document, suggest maual download)
# *394(review: no FAD, no ERG)
# *381(replaced by TA620 in Jan2020)
# 366(wrongly named, "appraisal-consultation-document", suggest manual download)
# 292(wrongly named, "final-appraisal-determintation-document2", manual download)
# 266(wrongly named, "final-appraisal-determinaton")
# 264(wrongly named, "final-appriasal-determination", manual download)
# 55(no FAD, but assessment report)
# 38(no FAD, but HTA report)
# 34(no FAD, but Assessment report)
# 29(no FAD, but Assessment report)
# 23(no FAD, but Assessment report)
# 20(no FAD, but Assessment report)
# 10(no FAD, but Assessment report)
# *1(review: no FAD, no ERG)
TA_withdrawn <- c(532, 459)
TA_replaced <- c(493, 381)
TA_noCSFADERG <- c(394, 1)
#43 (43 terminated + 6)
test <- Fulltext_FAD_amend
test[2:6] <- sapply(test[2:6], as.character)
test[TA_vector_terminated, 2:6] <- "terminated"
test[TA_vector %in% TA_withdrawn, 2:6] <- "withdrawn"
test[TA_vector %in% TA_replaced, 2:6] <- "replaced"
test[TA_vector %in% TA_noCSFADERG, 2:6] <- "no_CS_FAD_ERG"
Fulltext_FAD_amend <- test
Fulltext_FAD_amend$TA[rowSums(!is.na(Fulltext_FAD_amend[ , 2:6])) == 0] #12
# [1] 404 366 292 266 264 55 38 34 29 23 20 10 (TAs without FAD) see explanation above
# output UPDATED FAD
write.csv(Fulltext_FAD_amend, "Fulltext_FAD_20200305.csv")
write.csv(FAD_links, "Links_FAD_20200305.csv")
### download CP links (some CP link can be linked to html website but not pdf)
CP_links <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "committee-papers")
TA_vector[rowSums(!is.na(CP_links[ , 12:13])) != 0] # TA 473 will have some undownloaded documents but it is too long
# e.g. TA 209evaluation report (seems like CP changed names)
ER_links <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "evaluation-report")
rowSums(!is.na(ER_links[ , 2:10]))
# e.g. TA 192
ERGR_links <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "erg-report")
rowSums(!is.na(ERGR_links[ , 2:10]))
# e.g. TA 123
ERGR_links2 <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "evidence-review-group-report")
rowSums(!is.na(ERGR_links2[ , 2:10]))
# TA191
ERGR_links3 <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "evidence-review-groups-report")
rowSums(!is.na(ERGR_links3[ , 2:10]))
# e.g. TA38
HTA_links <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "hta-report")
rowSums(!is.na(HTA_link[ , 2:10]))
# e.g. TA75
HTA_links2 <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "health-technology-assessment")
rowSums(!is.na(HTA_link2[ , 2:10]))
# e.g. TA278, TA61, TA 59
AR_links <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "assessment-report")
rowSums(!is.na(AR_links[ , 2:10]))
# TA195 really long
# e.g. TA188
AR_links2 <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "assessment-group-report")
rowSums(!is.na(AR_links2[ , 2:10]))
#
# check how many links (including CP) won't exceed 8 download spacces (can use the data frame of CP_amend to update)
summary(as.factor(rowSums(!is.na(CP_links[ , 2:11])) +
rowSums(!is.na(ER_links[ , 2:11])) +
rowSums(!is.na(ERGR_links[ , 2:11])) +
rowSums(!is.na(ERGR_links2[ , 2:11])) +
rowSums(!is.na(ERGR_links3[ , 2:11])) +
rowSums(!is.na(HTA_links[ , 2:11])) +
rowSums(!is.na(HTA_links2[ , 2:11])) +
rowSums(!is.na(AR_links[ , 2:11])) +
rowSums(!is.na(AR_links2[ , 2:11]))))
# 0 1 2 3 4 5 6 7 8 9 10 11 13
# 50 98 139 67 44 21 15 12 5 3 4 1 1
# links can be stored together
summary(as.factor( rowSums(!is.na(ER_links[ , 2:10])) +
rowSums(!is.na(ERGR_links[ , 2:10])) +
rowSums(!is.na(ERGR_links2[ , 2:10])) +
rowSums(!is.na(ERGR_links3[ , 2:10])) +
rowSums(!is.na(HTA_links[ , 2:10])) +
rowSums(!is.na(HTA_links2[ , 2:10])) +
rowSums(!is.na(AR_links[ , 2:10])) +
rowSums(!is.na(AR_links2[ , 2:10]))))
# 0 1 2 3 4 5 6 7 8 9 10
# 252 58 65 27 24 9 9 7 4 4 1
# still no files for CP at all #50
no_CP <- TA_vector_index [(as.factor(rowSums(!is.na(CP_links[ , 2:11])) +
rowSums(!is.na(ER_links[ , 2:11])) +
rowSums(!is.na(ERGR_links[ , 2:11])) +
rowSums(!is.na(ERGR_links2[ , 2:11])) +
rowSums(!is.na(ERGR_links3[ , 2:11])) +
rowSums(!is.na(HTA_links[ , 2:11])) +
rowSums(!is.na(HTA_links2[ , 2:11])) +
rowSums(!is.na(AR_links[ , 2:11])) +
rowSums(!is.na(AR_links2[ , 2:11])))) == 0]
length(no_CP)
no_CP
no_CP <- no_CP[no_CP %notin% TA_vector_terminated] # delete those terminated
no_CP <- no_CP[no_CP %notin% TA_vector_index[TA_vector %in% TA_noCSFADERG]] # NO CSFADERG
no_CP <- no_CP[no_CP %notin% TA_vector_index[TA_vector %in% TA_replaced]] # NO replaced
no_CP <- no_CP[no_CP %notin% TA_vector_index[TA_vector %in% TA_withdrawn]] # NO wthdrawn # 65
no_CP
# [1] 440 447
TA_vector[no_CP]
# [1] 77 (protocol-newer-hypnotic-drugs-for-shortterm-pharmacotherapy-for-insomnia2)
# 64 (report-by-a-consortium)
no_CP <- TA_vector_index [(as.factor(rowSums(!is.na(CP_links[ , 2:11])) +
rowSums(!is.na(ER_links[ , 2:11])) +
rowSums(!is.na(ERGR_links[ , 2:11])) +
rowSums(!is.na(ERGR_links2[ , 2:11])) +
rowSums(!is.na(ERGR_links3[ , 2:11])) +
rowSums(!is.na(HTA_links[ , 2:11])) +
rowSums(!is.na(HTA_links2[ , 2:11])) +
rowSums(!is.na(AR_links[ , 2:11])) +
rowSums(!is.na(AR_links2[ , 2:11])))) == 0]
# find where files need to be re-downloaded (where there is no TA at all)
CP <- read.csv(file = "output/Fulltext_CP_20200305.csv", header = TRUE)
CP$X <- NULL
CP[TA_vector_terminated , 2:11] <- NA
summary(as.factor(rowSums(!is.na(CP_links[ , 2:13])))) # 204 TAs that has no links in original CP download
# 0 1 2 3 4 5 6 7 9 12
# 229 60 86 41 26 9 4 3 1 1
rowSums(!is.na(CP_links[ , 2:11]))
# reduced link up to 5 and manual download those more than 5 links of not corretly downloaded
CP_links[ , 12:13] <- NULL # manual download those who had 9 and & links
# initial file download of FAD
# use FAD as base case and add one more column
CP_temp <- CP
download_status <- as.matrix(CP_temp)
file_name <- matrix (c(paste("TA", TA_vector, "_CP1.pdf", sep = ""),
paste("TA", TA_vector, "_CP2.pdf", sep = ""),
paste("TA", TA_vector, "_CP3.pdf", sep = ""),
paste("TA", TA_vector, "_CP4.pdf", sep = ""),
paste("TA", TA_vector, "_CP5.pdf", sep = ""),
paste("TA", TA_vector, "_CP6.pdf", sep = ""),
paste("TA", TA_vector, "_CP7.pdf", sep = ""),
paste("TA", TA_vector, "_CP8.pdf", sep = ""),
paste("TA", TA_vector, "_CP9.pdf", sep = ""),
paste("TA", TA_vector, "_CP10.pdf", sep = "")),
ncol = 10, byrow = F)
# Download CP file with links
TA_vector_index <- 1:length(TA_vector)
length(TA_vector_index[rowSums(!is.na(CP[ , 2:11])) == 0]) # 247 index need to be download use link (including termination)
Fulltext_CP_amend <- file_download2(TA_number = TA_vector_index[rowSums(!is.na(CP[ , 2:11])) == 0],
url = CP_links, file_name)
length(TA_vector_index[rowSums(!is.na(Fulltext_CP_amend[ , 2:11])) == 0]) # 225 index still need to be download use link
TA_vector_terminated <- grep("terminated", title$title)
test <- Fulltext_CP_amend
test[2:11] <- sapply(test[2:11], as.character)
test[TA_vector_terminated, 2:11] <- "terminated"
test[TA_vector %in% TA_withdrawn, 2:11] <- "withdrawn"
test[TA_vector %in% TA_replaced, 2:11] <- "replaced"
test[TA_vector %in% TA_noCSFADERG, 2:11] <- "no_CS_FAD_ERG"
Fulltext_CP_amend <- test
length(TA_vector_index[rowSums(!is.na(Fulltext_CP_amend[ , 2:11])) == 0]) # 180 index still need to be download use link
# output UPDATED CP
write.csv(Fulltext_CP_amend, "Fulltext_CP_20200305.csv")
write.csv(CP_links, "Links_CP_20200305.csv")
#####################################################################
# download the rest of ERG
######################################################################
# create url matrix for download
url_ERG <- matrix (rep(NA, 11*length(TA_vector)), byrow = T, ncol = 11)
url_ERG[ , 1] <- TA_vector
for (n in c(TA_vector_index)){
x <- c(unname(unlist(ER_links[n, 2:11])[!is.na(unlist(ER_links[n, 2:11]))]),
unname(unlist(ERGR_links[n, 2:11])[!is.na(unlist(ERGR_links[n, 2:11]))]),
unname(unlist(ERGR_links2[n, 2:11])[!is.na(unlist(ERGR_links2[n, 2:11]))]),
unname(unlist(ERGR_links3[n, 2:11])[!is.na(unlist(ERGR_links3[n, 2:11]))]),
unname(unlist(HTA_links[n, 2:11])[!is.na(unlist(HTA_links[n, 2:11]))]),
unname(unlist(HTA_links2[n, 2:11])[!is.na(unlist(HTA_links2[n, 2:11]))]),
unname(unlist(AR_links[n, 2:11])[!is.na(unlist(AR_links[n, 2:11]))]),
unname(unlist(AR_links2[n, 2:11])[!is.na(unlist(AR_links2[n, 2:11]))])
)
x <- x[0:min(10, length(x))]
if (length(x) != 0){
url_ERG[n, 2:(length(x)+1)] <- x
}
}
url_ERG <- as.data.frame(url_ERG)
file_name <- matrix (c(paste("TA", TA_vector, "_ERG1.pdf", sep = ""),
paste("TA", TA_vector, "_ERG2.pdf", sep = ""),
paste("TA", TA_vector, "_ERG3.pdf", sep = ""),
paste("TA", TA_vector, "_ERG4.pdf", sep = ""),
paste("TA", TA_vector, "_ERG5.pdf", sep = ""),
paste("TA", TA_vector, "_ERG6.pdf", sep = ""),
paste("TA", TA_vector, "_ERG7.pdf", sep = ""),
paste("TA", TA_vector, "_ERG8.pdf", sep = ""),
paste("TA", TA_vector, "_ERG9.pdf", sep = ""),
paste("TA", TA_vector, "_ERG10.pdf", sep = "")
), ncol = 10, byrow = F)
# Download ERG file with links
download_status[ , 2:11] <- NA
url_ERG[2:11] <- sapply(url_ERG[2:11], as.character)
Fulltext_ERG <- file_download2(TA_number = TA_vector_index,
url = url_ERG, file_name)
length(TA_vector_index[rowSums(!is.na(Fulltext_ERG[ , 2:11])) == 0]) # 252 index still need to be download use link
TA_vector_terminated <- grep("terminated", title$title)
test <- Fulltext_ERG
test[2:11] <- sapply(test[2:11], as.character)
test[TA_vector_terminated, 2:11] <- "terminated"
test[TA_vector %in% TA_withdrawn, 2:11] <- "withdrawn"
test[TA_vector %in% TA_replaced, 2:11] <- "replaced"
test[TA_vector %in% TA_noCSFADERG, 2:11] <- "no_CS_FAD_ERG"
Fulltext_ERG <- test
# output UPDATED CP
write.csv(Fulltext_ERG, "Fulltext_ERG_20200305.csv")
write.csv(url_ERG, "Links_ERG_20200305.csv")
###################################
# Download TA history file
# convert html to pdf
###################################
# ref: https://rdrr.io/cran/pagedown/man/chrome_print.html
# install.packages("pagedown")
# create url
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/history", sep = "")
file_name <- paste("TA", TA_vector, "_history.pdf", sep = "")
history <- data.frame (TA = TA_vector,
Status = rep(NA, length(TA_vector)))
indices <- 1:length(TA_vector)
download_history <- function(url, file_name, indices){
for(n in c(indices)){
tryCatch({
# read html
chrome_print(url[n], output = file_name[n])
history$Status[n] <- "downloaded"
}, error = function(e){
cat("ERROR :",conditionMessage(e), "\n")
})
# random sleeping time
sleepy = sample(c(0.5:3), 1)
cat("\n let's just wait for",sleepy,"seconds...")
Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
}
return(history)
}
# download all history
download_history(url = url, file_name = file_name, indices = indices)
history[is.na(history$Status), "Status"] <- "NA"
history[history$Status == "NA", "TA"]
# [1] 556 550 547 507 435 434 431 362 359 353 351 350 169 167 161 34 20 10 1
# output the indicies where files are not downloaded and try again
indices <- which(grepl("NA", history$Status))
download_history(url = url, file_name = file_name, indices = indices)
history[history$Status == "NA", "TA"]
# [1] 556 434 431 167 161
# manual download these files!
history[history$Status == "NA", "Status"] <- "downloaded"
# save file
write.csv(history, "History_20200303.csv")